L4: Data Visualization

Bogdan G. Popescu

John Cabot University

Visualizing Data

  • R has several systems for making graphs
  • ggplot2 is the quickest and reliable package in R for making graphs
  • You should first load the packages ggplot2 and ggthemes in the following way
library("ggplot2")
library("ggthemes")

If you get an error that indicates that a package does not exist, you should install these packages:

#install.packages("ggplot2")
#install.packages("ggthemes")

Example

  • We can start off with an example, examining the relationship between urbanization and life expectancy

  • The research question is: Do countries with higher levels of urbanization also have higher life expectancy?

  • What does the relationship between urbanization life expectancy look like? Positive? Negative? Nonlinear?

  • Does the relationship vary by continent?

The data

setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_exp_urb <- read.csv(file = './life_exp_urb.csv')
#Step2: Examining the first five entries
head(life_exp_urb, n=5)
       Entity life_expectancy urbanization            type
1       Aruba        70.25972     48.05939 Everything Else
2 Afghanistan        45.38333     18.61175 Everything Else
3      Angola        45.08466     37.53970 Everything Else
4    Anguilla        69.44028           NA Everything Else
5     Albania        68.28611     40.44416 Everything Else

The data

head(life_exp_urb, n=3)
       Entity life_expectancy urbanization            type
1       Aruba        70.25972     48.05939 Everything Else
2 Afghanistan        45.38333     18.61175 Everything Else
3      Angola        45.08466     37.53970 Everything Else
  • The dataframe shows us:
    • Variables such as: entity, life_expectancy, urbanization, and type
    • Values - state of variable when we measure it, e.g.: 70.25972, 45.38333
    • Observations or data points - contains several values, each associated with different variables, e.g. Aruba observation, Afghanistan observation

The data

A good way way to examine the dataframe is

head(life_exp_urb, n=3)
       Entity life_expectancy urbanization            type
1       Aruba        70.25972     48.05939 Everything Else
2 Afghanistan        45.38333     18.61175 Everything Else
3      Angola        45.08466     37.53970 Everything Else

or

library(dplyr)
glimpse(life_exp_urb)
Rows: 237
Columns: 4
$ Entity          <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania…
$ life_expectancy <dbl> 70.25972, 45.38333, 45.08466, 69.44028, 68.28611, 77.0…
$ urbanization    <dbl> 48.059393, 18.611754, 37.539705, NA, 40.444164, 87.043…
$ type            <chr> "Everything Else", "Everything Else", "Everything Else…

The data

library(dplyr)
glimpse(life_exp_urb)
Rows: 237
Columns: 4
$ Entity          <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania…
$ life_expectancy <dbl> 70.25972, 45.38333, 45.08466, 69.44028, 68.28611, 77.0…
$ urbanization    <dbl> 48.059393, 18.611754, 37.539705, NA, 40.444164, 87.043…
$ type            <chr> "Everything Else", "Everything Else", "Everything Else…

Among the variables, we have:

  • Entity - country name - character or string variable

  • life_expectancy - average life expectancy - double-precision floating variable

  • urbanization - percentage level of urbanization - double-precision floating variable

  • type - group of countries - EU, Latin America, Everything Else - character or string variable

Goal

The final goal is to obtain a graph like this:

Creating the Graph

The first step is to define a plot object and add layers to it

ggplot(data = final_new)

Creating the Graph

The second step is to add layers

The mapping argument of the ggplot() function defines how variables in your dataset are mapped to visual properties (aesthetics) of your plot.

The mapping specifies the x and y

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))

Creating the Graph

The third step is to define a geometrical object - geom to plot the data

There are different types of geometrical objects

  • geom_bar() - bar geoms

  • geom_line() - line geoms

  • geom_boxplot() - boxplot geoms

  • geom_point() - point geoms

Creating the Graph

In our case, we will use geom_point()

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))+
  geom_point()

Creating the Graph

We now have something that looks like scatterplot

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))+
  geom_point()

There appears to be a positive relationship between urbanization and life expectancy: more urbanization means higher life expectancy

Countries with higher urbanization have higher life expectancy

Creating the Graph

The fourth step is to add aesthetics

Could the relationship between urbanization and life expectancy depend on the type of countries: e.g. EU, Latin America, the rest of the world?

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy, color = type))+
  geom_point()

Creating the Graph

To add more visual clarity about the relationship between urbanization and life expectancy on a continent level, we can include a smooth curve

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy, color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Creating the Graph

We can a global fitting line by using the following code:

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))+
  geom_point(mapping = aes(color = type))+
  geom_smooth(method = "lm")

Creating the Graph

Note the difference between the two codes:

Original

ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Changed

ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy))+
  geom_point(mapping = 
               aes(color = type))+
  geom_smooth(method = "lm")

Creating the Graph

To make the difference between different groups more obvious, we can choose different shapes.

Thus, we can also map type to the shape aesthetic in addition to color.

ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy))+
  geom_point(mapping = aes(color = type, shape = type))+
  geom_smooth(method = "lm")

Improving the Graph

We can improve our graph by adding labels by using the labs() function

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))+
  geom_point(mapping = aes(color = type, shape = type))+
  geom_smooth(method = "lm")+
    labs(title = "Urbanization and Life Expectancy",
    x = "Urbanization (pct.)", y = "Life Expectancy (years)",
    color = "Type", shape = "Type")

Barplots and Distributions

ggplot2 can also be used to visualize categorical and numerical variables.

For example, we can think of type as a categorical variable:

In other words, whether this is the EU, Latin America, or everything else, this would be a categorical variable

Categorical Variables - can take on a limited number of possible values. Each observation is associated to a particular group based on some qualitative property.

Barplots with Categorical Variables

This is how we can represent a barplot

ggplot(final_new, aes(x = type)) +
  geom_bar()

Barplots with Categorical Variables

We can order the bars in descending order using the following

library(forcats)
ggplot(final_new, aes(x = fct_infreq(type))) +
  geom_bar()

Barplots with Numerical Variables: Histograms

Numerical Variables can take on a wide range of numerical values

One come way to visualize numerical variables is with the help of distributions

ggplot(final_new, aes(x = life_expectancy)) +
geom_histogram(binwidth = 1)

Barplots with Numerical Variables: Histograms

ggplot(final_new, aes(x = life_expectancy)) +
geom_histogram(binwidth = 1)

A histogram divides the x-axis (horizontal) into equally spaced bins

It then uses the height of a bar to display the number of observations that fall in each bin.

Barplots with Numerical Variables: Histograms

ggplot(final_new, aes(x = life_expectancy)) +
geom_histogram(binwidth = 1)

This specific histograms shows that the majority of the countries the sample have an average life expectancy of around 66 and 68 years.

Barplots with Numerical Variables: Histograms

We can set the width of the intervals in a histogram with the binwidth argument.

This is measured in the units of the x variable.

Let us look at differences between binsizes

Barplots with Numerical Variables: Histograms

Binwidth: 1

ggplot(final_new, 
       aes(x = life_expectancy)) +
geom_histogram(binwidth = 1)

Binwidth: 10

ggplot(final_new, 
       aes(x = life_expectancy)) +
geom_histogram(binwidth = 10)

Barplots with Numerical Variables: Density Plots

  • An alternative to histograms is a density plot

  • This is a smoothed-out version of a histogram

Histogram

ggplot(final_new, 
       aes(x = life_expectancy)) +
geom_histogram(binwidth = 1)

Plot Density

ggplot(final_new, 
       aes(x = life_expectancy)) +
geom_density()

Visualizing relationships: Boxplots

  • To visualize relationships, we need at least two variables mapped to aesthetics

  • A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution.

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Visualizing relationships: Boxplots

Let us now look at our own dataset

ggplot(final_new, aes(x = type, y = life_expectancy)) +
  geom_boxplot()

Visualizing relationships: Boxplots

We can also create a density plot

ggplot(final_new, aes(x = life_expectancy, color = type))+
  geom_density(linewidth = 0.75)

Visualizing relationships: Boxplots

We can use the color, fill, and alpha aesthetics to o add transparency to the filled density curves

ggplot(final_new, aes(x = life_expectancy, 
                      color = type,fill = type))+
  geom_density(linewidth = 0.75, alpha = 0.5)

Categorical Variables and Facets

Previously, we looked at one way to present the relationship between urbanization and life expectancy, continent by continent

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Categorical Variables and Facets

Another (more effective) way is to plot facets

ggplot(data = final_new, 
  mapping = aes(x=urbanization, y=life_expectancy))+
  geom_point(mapping = aes(color = type))+
  geom_smooth(method = "lm")+
  facet_wrap(~type)

Categorical Variables and Facets

Here is how the two compare

One Graph

ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Facets

ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy))+
  geom_point(mapping = aes(color = type))+
  geom_smooth(method = "lm")+
  facet_wrap(~type)

Saving Plots

Once you created your plot, you should save your files with ggsave

We should first provide a name for the object

object <- ggplot(data = final_new, 
  mapping = aes(x=urbanization, 
                y=life_expectancy, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

And then save that object

ggsave(object, filename = "figure1.jpg")

This will be saved in your working directory.

Saving Plots

You can also be more precise about about the dimensions of your figure

#This is the path
ggsave(object,
       filename="figure2.jpg",
       height=25, width=20.2, 
       units = "cm", 
       dpi=300)

Temporal Data

You may also want to plot temporal data

To do that we need to load the life expectancy data over time

The data is available at: Life expectancy

setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')

Let us look again at the data

head(life_expectancy, n=3)
       Entity Code Year Life.expectancy.at.birth..historical.
1 Afghanistan  AFG 1950                                  27.7
2 Afghanistan  AFG 1951                                  28.0
3 Afghanistan  AFG 1952                                  28.4

Temporal Data

Let us imagine that we want to plot life expectancy over time

life_expectancy2<-life_expectancy%>%
  group_by(Year)%>%
  summarize(life_expectancy=mean(Life.expectancy.at.birth..historical.))

Let us look at the transformed data.

head(life_expectancy2, n=3)
# A tibble: 3 × 2
   Year life_expectancy
  <int>           <dbl>
1  1543            33.9
2  1548            38.8
3  1553            39.6

Temporal Data

We should now create a simple timeplot of life expectancy over time

ggplot(life_expectancy2, aes(x = Year, y = life_expectancy)) +
  geom_line()

Up until 1950, there is a lot of volatility in life expectancy

Temporal Data

To view the upward sloping trend in life expectancy, it might be worth aggregating some of the years.

ggplot(life_expectancy2, aes(x = Year, y = life_expectancy)) +
  geom_line()+
  geom_smooth()

Temporal Data: Two Countries

Let us imagine that we want to compare Italy and the US in life expectancy

This means we go back to the original file and select the US and Italy

setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
#Step2: Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Step3: Renaming Variables
names(life_expectancy_itus)[names(life_expectancy_itus)=="Life.expectancy.at.birth..historical."] <- "life_exp"
head(life_expectancy_itus, n=4)
     Entity Code Year life_exp
8555  Italy  ITA 1872    29.70
8556  Italy  ITA 1873    31.61
8557  Italy  ITA 1874    31.76
8558  Italy  ITA 1875    31.33

Temporal Data: Two Countries

We can now plot life expectancy for the two countries over time

#Step4: Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Step5: Renaming Variables
names(life_expectancy_itus)[names(life_expectancy_itus)==
                              "Life.expectancy.at.birth..historical."] <- "life_exp"
ggplot(life_expectancy_itus, 
       aes(x=Year, y= life_exp, color = Entity))+
  geom_line()